AIML Classification Project Personal Loan Modeling

author: Aidos Utegulov Feb 21 cohort

Description

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more. Data Dictionary

Import libraries

Read the data

Dataset summary

Observations and Insights

Drop ID column

EDA

I'm using the function taken from the MLS session notebook that plots histogram and boxplots of the feature:

Univariate Analysis

Observations on Age

Observations on average spendings on credit card

Observations on Income

Observations on Education

Observations on Experience

Observations on Family

Observations on Mortgage

Observations on ZIPCode

Countplots of categorical dummy variables:

Bivariate Analysis

Identify Correlation in data

Heatmap of correlations

Observations and Insights

Pair plots of all columns

Observation There is strong collinearity visible between variables Age and Experience. Which makes sense, since people get more experienced with age. There is also some linear relationships between variables Income and CCAvg

Scatterplots of Age vs Income and Income vs CCAvg

Observations: As Income grows, so does the number of people who take Personal Loans, there is an evident cutoff point close 100K after which there seems to be a lot of people who took personal loans.

Generating stacked plots of most interesting features according to how many people took a Personal Loan

Observations and Insights

Good time to copy the data for decision tree analysis

Feature Engineering

Negative values in Experience

I got rid of negative values by simply multiplying by negative 1

Outlier treatment

From the section on univariate analysis, it was clear that we have two columns: Income and Mortgage that have significant amount of outliers. We shall deal with those outliers by substituting them by upper whisker value.

Check to see if treatment worked on columns with outliers:

We have successfully treated the outliers in the three columns of interest

Converting ZIP code into categorical column

Install pyzipcode to get city and state from a zip code

Observations: some of the zip codes could not be identified by pyzipcode, so I replaced them with the 'Unknown' string which later be treated as a separate category

Now we can safely drop the ZIPCode column

Observations:

Logistic Regression Model

Now that the dataset has been pre-processed we can begin building the model. I'll start by defining two useful functions: one to obtain the metrics for the model and another one to build the confusion matrix. Both these functions have been provided in the lectures and mentored sessions.

Split Data

I will split the data 70% to 30% test/train ratio. We also have to make sure we get the same number of true/false values for the dependent variable in both training and test sets.

The dataset was split correctly

Check if our variables have multicollinearity:

Variance Inflation factor: Variance inflation factors measure the inflation in the variances of the regression coefficients estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is “inflated”by the existence of correlation among the predictor variables in the model.

General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.

Removing Multicollinearity

To remove multicollinearity

  1. Drop every column one by one, that has VIF score greater than 5.
  2. Look at the adjusted R square of all these models
  3. Drop the Variable that makes least change in Adjusted-R square
  4. Check the VIF Scores again
  5. Continue till you get all VIF scores under 5

The method below to remove multicollinearity was provided to us at one of MLS sessions:

Observation: Dropping the columns with high vif score definitely improves the model performance, so I am going to drop the Age and Experience columns:

Finally we are ready to fit the model and get the confusion matrix

Fitting the Logistic Regression model

Finding the coefficients

Observations on coefficients

Converting coefficients to odds

Observations and Insights

The model is generalizing quite well to the test set. I'm generally satisfied with the evaluation metrics for the model, but let us try to improve the metrics slightly:

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Use Precision-Recall curve and see if we can find a better threshold

This is much better, we have approximately same values for recall and precision arount 71%

Selecting subset of important features using Sequential Feature Selector method

Benefits of feature selection:

Observation I don't know why it is not plotting anything, it's clearly working

Look at model performance

Conclusions and Recommendations for the Logistic Regression Classifier Model

Model Building for Decision Tree

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Split Data

Build Decision Tree Model

We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 91% accuracy, hence accuracy is not a good metric to evaluate here. We will focus on recall instead:

The model seems to be overfitting a bit. Let's visualize the tree:

According to the model, Income is the most important feature. Also, the tree is way too complicated. Let's try to simplify it:

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Cost Complexity Pruning

Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first

Кemove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases

Accuracy vs alpha for training and testing sets

Maximum value of recall occurs at ~0.008. But let's try and choose the best fit:

Visualizing the Decision Tree

Observations and Insights

Let's look at the importance of features:

Conclusions and Recommendations

Recommendations for the bank